[Core] Fix deadlock in garbage collection when holding lock#60014
[Core] Fix deadlock in garbage collection when holding lock#60014edoakes merged 2 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a deferred release mechanism using a background thread to fix a critical deadlock during garbage collection. The approach is sound, and the new test case effectively reproduces the issue and validates the fix. I have identified a potential resource leak where object IDs in the release batch may not be flushed upon worker shutdown. Additionally, I've suggested an improvement to the close method to log a warning if the release thread doesn't terminate gracefully, which will help in debugging potential future issues.
Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
f66a1cb to
cb6e929
Compare
edoakes
left a comment
There was a problem hiding this comment.
Thanks for working on this @RedGrey1993!
755b4b0 to
5f4a5b5
Compare
Signed-off-by: redgrey1993 <ulyer555@hotmail.com>
5f4a5b5 to
5f0e7bb
Compare
|
@edoakes Thanks for the review. I've updated the code according to your suggestions. Please review again at your convenience. |
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
…ect#60014) ## Description This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers `ClientObjectRef.__del__()` while the DataClient lock is held. When using Ray Client, a deadlock can occur in the following scenario: 1. Main thread acquires DataClient.lock (e.g., in _async_send()) 2. Garbage collection is triggered while holding the lock 3. GC calls `ClientObjectRef.__del__()` 4. `__del__()` attempts to call call_release() → _release_server() → DataClient.ReleaseObject() 5. ReleaseObject() tries to acquire the same DataClient.lock 6. Deadlock: The same thread tries to acquire a non-reentrant lock it already holds ## Related issues > Fixes ray-project#59643 ## Additional information This PR implements a deferred release pattern that completely avoids the deadlock: 1. Deferred Release Queue: Introduces _release_queue (a thread-safe queue.SimpleQueue) to collect object IDs that need to be released 2. Background Release Thread: Adds _release_thread that processes the release queue asynchronously 3. Non-blocking `__del__`: `ClientObjectRef.__del__()` now only puts IDs into the queue (no lock acquisition) --------- Signed-off-by: redgrey1993 <ulyer555@hotmail.com> Co-authored-by: redgrey1993 <ulyer555@hotmail.com>
Description
This PR fixes a critical deadlock issue in Ray Client that occurs when garbage collection triggers
ClientObjectRef.__del__()while the DataClient lock is held.When using Ray Client, a deadlock can occur in the following scenario:
ClientObjectRef.__del__()__del__()attempts to call call_release() → _release_server() → DataClient.ReleaseObject()Related issues
Additional information
This PR implements a deferred release pattern that completely avoids the deadlock:
__del__:ClientObjectRef.__del__()now only puts IDs into the queue (no lock acquisition)